import pandas as pd
import numpy as np
import plotly.express as pxWorkshop: Creating and Transforming Variables
Introduction
For this workshop, you’ll be cleaning and modifying a dataset with functions like drop(), rename(), to_numeric(), and replace(). These have been covered in the last few weeks of the prework.
Step 1: Load packages and data
To get started, load in the needed packages: pandas, numpy, and plotly.express.
Now, read in the dataset called physical_activity.csv from your data folder and save it as pa.
It should have 142 rows and 9 columns.
# reading the physical_activity csv
pa = pd.read_csv("Data/physical_activity.csv")
pa| personal_id | household_id | sex | status | bmi | sedentary_ap_s_day | light_ap_s_day | mvpa_s_day | oms_recommendation | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | M | Adulte | 32.2 | 41483 | 5782.571429 | 816.5714286 | No |
| 1 | 2 | 2 | F | Adulte | 26.1 | 43091 | 4102.571429 | 486.7142857 | No |
| 2 | 3 | 2 | F | Enfant | 16.7 | 38630.71429 | 1874.571429 | 2806 | No |
| 3 | 4 | 2 | M | Enfant | 17.1 | 40290.28571 | 3196 | 4712 | Yes |
| 4 | 5 | 3 | M | Adulte | missing | 73121.57143 | 863.2857143 | 267.5714286 | No |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 137 | 138 | 39 | F | Enfant | 20.1 | 41610 | 3426.285714 | 1378.428571 | No |
| 138 | 139 | 39 | F | Enfant | 19.8 | 35515.14286 | 4281.285714 | 2385.857143 | No |
| 139 | 140 | 40 | M | Adulte | 37.1 | 38259.28571 | 4905.285714 | 1153.857143 | Yes |
| 140 | 141 | 40 | F | Adulte | 37.6 | 44304.85714 | 5039.857143 | 664.2857143 | No |
| 141 | 142 | 40 | F | Enfant | 35.6 | 45467.42857 | 3039.857143 | 1137.428571 | No |
142 rows × 9 columns
Step 2: Subset and rename variables
Next, drop the “personal_id” and “household_id” variables from the DataFrame.
# dropping the personal_id and household_id variables using drop:
pa = pa.drop(columns = ["personal_id","household_id"])
pa| sex | status | bmi | sedentary_ap_s_day | light_ap_s_day | mvpa_s_day | oms_recommendation | |
|---|---|---|---|---|---|---|---|
| 0 | M | Adulte | 32.2 | 41483 | 5782.571429 | 816.5714286 | No |
| 1 | F | Adulte | 26.1 | 43091 | 4102.571429 | 486.7142857 | No |
| 2 | F | Enfant | 16.7 | 38630.71429 | 1874.571429 | 2806 | No |
| 3 | M | Enfant | 17.1 | 40290.28571 | 3196 | 4712 | Yes |
| 4 | M | Adulte | missing | 73121.57143 | 863.2857143 | 267.5714286 | No |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 137 | F | Enfant | 20.1 | 41610 | 3426.285714 | 1378.428571 | No |
| 138 | F | Enfant | 19.8 | 35515.14286 | 4281.285714 | 2385.857143 | No |
| 139 | M | Adulte | 37.1 | 38259.28571 | 4905.285714 | 1153.857143 | Yes |
| 140 | F | Adulte | 37.6 | 44304.85714 | 5039.857143 | 664.2857143 | No |
| 141 | F | Enfant | 35.6 | 45467.42857 | 3039.857143 | 1137.428571 | No |
142 rows × 7 columns
Now, rename oms_recommendation to who_recommendation. (OMS is the French name for WHO.)
# renaming the `oms_recommendation` to `who_recommendation`. (OMS is the French name for WHO.)
pa = pa.rename(columns={"oms_recommendation": "who_recommendation"})
pa | sex | status | bmi | sedentary_ap_s_day | light_ap_s_day | mvpa_s_day | who_recommendation | |
|---|---|---|---|---|---|---|---|
| 0 | M | Adulte | 32.2 | 41483 | 5782.571429 | 816.5714286 | No |
| 1 | F | Adulte | 26.1 | 43091 | 4102.571429 | 486.7142857 | No |
| 2 | F | Enfant | 16.7 | 38630.71429 | 1874.571429 | 2806 | No |
| 3 | M | Enfant | 17.1 | 40290.28571 | 3196 | 4712 | Yes |
| 4 | M | Adulte | missing | 73121.57143 | 863.2857143 | 267.5714286 | No |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 137 | F | Enfant | 20.1 | 41610 | 3426.285714 | 1378.428571 | No |
| 138 | F | Enfant | 19.8 | 35515.14286 | 4281.285714 | 2385.857143 | No |
| 139 | M | Adulte | 37.1 | 38259.28571 | 4905.285714 | 1153.857143 | Yes |
| 140 | F | Adulte | 37.6 | 44304.85714 | 5039.857143 | 664.2857143 | No |
| 141 | F | Enfant | 35.6 | 45467.42857 | 3039.857143 | 1137.428571 | No |
142 rows × 7 columns
Your data should now have 7 variables:
sex: male or femalestatus: adult (Adulte) or child (Enfant)bmi: body mass indexsedentary_ap_s_day: sedentary behaviour (seconds/day)light_ap_s_day: light physical activity (seconds/day)mvpa_s_day: moderate to vigorous physical activity (seconds/day)who_recommendation: whether or not WHO recommendations for physical activity are met
Step 3: Examine and convert data types
Let’s think about the types of variables in our dataset.
Check the data types that were automatically assigned to each variable with info().
# your code here
pa.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 142 non-null object
1 status 142 non-null object
2 bmi 142 non-null object
3 sedentary_ap_s_day 142 non-null object
4 light_ap_s_day 142 non-null object
5 mvpa_s_day 142 non-null object
6 who_recommendation 133 non-null object
dtypes: object(7)
memory usage: 7.9+ KB
Notice that all of your variables are of type object (which is a catch-all for strings), but some should be numeric (e.g., bmi).
This is because the numeric variables have the words “missing” in them to indicate that the value is missing. Can you spot those words when you view the dataset?
We’ll use the pd.to_numeric() function to convert these variables to numeric.
Here is an example of how to use this.
test_df = pd.DataFrame({"bmi": ["32.2", "missing", "18.8", "17.8", "21.1"]})
test_df["bmi"] = pd.to_numeric(test_df["bmi"], errors="coerce")
test_df| bmi | |
|---|---|
| 0 | 32.2 |
| 1 | NaN |
| 2 | 18.8 |
| 3 | 17.8 |
| 4 | 21.1 |
The missing values are represented as NaN after the conversion.
Now, fill in the blanks below to apply this to your pa DataFrame.
# Change variables' type
pa["bmi"] = pd.to_numeric(pa["bmi"], errors="coerce")
pa["sedentary_ap_s_day"] = pd.to_numeric(pa["sedentary_ap_s_day"], errors="coerce")
pa["light_ap_s_day"] = pd.to_numeric(pa["light_ap_s_day"], errors="coerce")
pa["mvpa_s_day"] = pd.to_numeric(pa["mvpa_s_day"], errors="coerce")Check the data types again with info().
Step 4: Recode the status variable
Next, notice that the status variable currently uses the French terms “Adulte” and “Enfant”.
# Run this code to check the values of status
pa["status"].value_counts()status
Adulte 72
Enfant 70
Name: count, dtype: int64
Use the .replace() method to transform the status variable by replacing “Adulte” with “Adult” and “Enfant” with “Child”.
# changing the values Adulte -> Adult and Enfant -> Child
pa["status"] = pa["status"].replace({"Adulte": "Adult",
"Enfant": "Child"})
pa["status"].value_counts()status
Adult 72
Child 70
Name: count, dtype: int64
Step 5: Convert the physical activity variables
There are 3 physical activity variables: sedentary behavior (sedentary_ap_s_day), light physical activity (light_ap_s_day), and moderate to vigorous physical activity (mvpa_s_day).
These variables are currently measured in seconds per day.
However, most recommendations for physical activity are in minutes per week, so we want to align with these measures.
To do this, complete the following manipulations:
Use arithmetic operations to create new variables called
sedentary_ap_m_wk,light_ap_m_wk, andmvpa_m_wkthat are in minutes per week.Remove the previous seconds per day variables.
# Converting the secs/day to mins/ wk using : (x/60)*7
pa["sedentary_ap_m_wk"] = (pa["sedentary_ap_s_day"]/60) * 7
pa["light_ap_m_wk"] = (pa["light_ap_s_day"]/60) * 7
pa["mvpa_m_wk"] = (pa["mvpa_s_day"]/60) * 7
pa| sex | status | bmi | sedentary_ap_s_day | light_ap_s_day | mvpa_s_day | who_recommendation | sedentary_ap_m_wk | light_ap_m_wk | mvpa_m_wk | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M | Adult | 32.2 | 41483.00000 | 5782.571429 | 816.571429 | No | 4839.683333 | 674.633333 | 95.266667 |
| 1 | F | Adult | 26.1 | 43091.00000 | 4102.571429 | 486.714286 | No | 5027.283333 | 478.633333 | 56.783333 |
| 2 | F | Child | 16.7 | 38630.71429 | 1874.571429 | 2806.000000 | No | 4506.916667 | 218.700000 | 327.366667 |
| 3 | M | Child | 17.1 | 40290.28571 | 3196.000000 | 4712.000000 | Yes | 4700.533333 | 372.866667 | 549.733333 |
| 4 | M | Adult | NaN | 73121.57143 | 863.285714 | 267.571429 | No | 8530.850000 | 100.716667 | 31.216667 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 137 | F | Child | 20.1 | 41610.00000 | 3426.285714 | 1378.428571 | No | 4854.500000 | 399.733333 | 160.816667 |
| 138 | F | Child | 19.8 | 35515.14286 | 4281.285714 | 2385.857143 | No | 4143.433334 | 499.483333 | 278.350000 |
| 139 | M | Adult | 37.1 | 38259.28571 | 4905.285714 | 1153.857143 | Yes | 4463.583333 | 572.283333 | 134.616667 |
| 140 | F | Adult | 37.6 | 44304.85714 | 5039.857143 | 664.285714 | No | 5168.900000 | 587.983333 | 77.500000 |
| 141 | F | Child | 35.6 | 45467.42857 | 3039.857143 | 1137.428571 | No | 5304.533333 | 354.650000 | 132.700000 |
142 rows × 10 columns
# Dropping the s_day variables:
pa = pa.drop(columns=["sedentary_ap_s_day", "light_ap_s_day", "mvpa_s_day"])
pa| sex | status | bmi | who_recommendation | sedentary_ap_m_wk | light_ap_m_wk | mvpa_m_wk | |
|---|---|---|---|---|---|---|---|
| 0 | M | Adult | 32.2 | No | 4839.683333 | 674.633333 | 95.266667 |
| 1 | F | Adult | 26.1 | No | 5027.283333 | 478.633333 | 56.783333 |
| 2 | F | Child | 16.7 | No | 4506.916667 | 218.700000 | 327.366667 |
| 3 | M | Child | 17.1 | Yes | 4700.533333 | 372.866667 | 549.733333 |
| 4 | M | Adult | NaN | No | 8530.850000 | 100.716667 | 31.216667 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 137 | F | Child | 20.1 | No | 4854.500000 | 399.733333 | 160.816667 |
| 138 | F | Child | 19.8 | No | 4143.433334 | 499.483333 | 278.350000 |
| 139 | M | Adult | 37.1 | Yes | 4463.583333 | 572.283333 | 134.616667 |
| 140 | F | Adult | 37.6 | No | 5168.900000 | 587.983333 | 77.500000 |
| 141 | F | Child | 35.6 | No | 5304.533333 | 354.650000 | 132.700000 |
142 rows × 7 columns
Step 6: Total physical activity
Create a new column that adds light physical activity and moderate to vigorous physical activity, and call it total_pa_m_wk.
This should give us the total amount of activity in minutes per week.
# Adding "light_ap_m_wk" and mvpa_m_k
pa["total_pa_m_wk"] = pa["light_ap_m_wk"] + pa["mvpa_m_wk"]
pa| sex | status | bmi | who_recommendation | sedentary_ap_m_wk | light_ap_m_wk | mvpa_m_wk | total_pa_m_wk | |
|---|---|---|---|---|---|---|---|---|
| 0 | M | Adult | 32.2 | No | 4839.683333 | 674.633333 | 95.266667 | 769.900000 |
| 1 | F | Adult | 26.1 | No | 5027.283333 | 478.633333 | 56.783333 | 535.416667 |
| 2 | F | Child | 16.7 | No | 4506.916667 | 218.700000 | 327.366667 | 546.066667 |
| 3 | M | Child | 17.1 | Yes | 4700.533333 | 372.866667 | 549.733333 | 922.600000 |
| 4 | M | Adult | NaN | No | 8530.850000 | 100.716667 | 31.216667 | 131.933333 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 137 | F | Child | 20.1 | No | 4854.500000 | 399.733333 | 160.816667 | 560.550000 |
| 138 | F | Child | 19.8 | No | 4143.433334 | 499.483333 | 278.350000 | 777.833333 |
| 139 | M | Adult | 37.1 | Yes | 4463.583333 | 572.283333 | 134.616667 | 706.900000 |
| 140 | F | Adult | 37.6 | No | 5168.900000 | 587.983333 | 77.500000 | 665.483333 |
| 141 | F | Child | 35.6 | No | 5304.533333 | 354.650000 | 132.700000 | 487.350000 |
142 rows × 8 columns
Step 7: Categorical variable for total physical activity
Write a custom function that bins its input variable into three categories:
low: less than 500 minutes per week (< 500)medium: between 500 and 1000 minutes per week (>= 500 and < 1000)high: greater than 1000 minutes per week (>= 1000)
Apply this function to the pa DataFrame to create a categorical variable called total_pa_cat
# defining a ftn to return low, medium and high:
def activity_cat (total_phy_activity):
if total_phy_activity < 500:
return "low"
elif total_phy_activity >= 500 and total_phy_activity < 1000:
return "medium"
elif total_phy_activity >= 1000:
return "high"
else:
return "ungraded"
activity_cat_vec = np.vectorize(activity_cat)# creating a new variable total_pa_cat using the vectorized ftn :
pa["total_pa_cat"] = activity_cat_vec(pa["total_pa_m_wk"])
pa | sex | status | bmi | who_recommendation | sedentary_ap_m_wk | light_ap_m_wk | mvpa_m_wk | total_pa_m_wk | total_pa_cat | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | M | Adult | 32.2 | No | 4839.683333 | 674.633333 | 95.266667 | 769.900000 | medium |
| 1 | F | Adult | 26.1 | No | 5027.283333 | 478.633333 | 56.783333 | 535.416667 | medium |
| 2 | F | Child | 16.7 | No | 4506.916667 | 218.700000 | 327.366667 | 546.066667 | medium |
| 3 | M | Child | 17.1 | Yes | 4700.533333 | 372.866667 | 549.733333 | 922.600000 | medium |
| 4 | M | Adult | NaN | No | 8530.850000 | 100.716667 | 31.216667 | 131.933333 | low |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 137 | F | Child | 20.1 | No | 4854.500000 | 399.733333 | 160.816667 | 560.550000 | medium |
| 138 | F | Child | 19.8 | No | 4143.433334 | 499.483333 | 278.350000 | 777.833333 | medium |
| 139 | M | Adult | 37.1 | Yes | 4463.583333 | 572.283333 | 134.616667 | 706.900000 | medium |
| 140 | F | Adult | 37.6 | No | 5168.900000 | 587.983333 | 77.500000 | 665.483333 | medium |
| 141 | F | Child | 35.6 | No | 5304.533333 | 354.650000 | 132.700000 | 487.350000 | low |
142 rows × 9 columns
Step 8: Visualize data with plotly express
In this section you will use pa to generate some plots!
Plot 1: Histogram
Histograms are best used to visualize the distribution of a single numeric (continuous) variable. Choose a variable from the dataset that you can plot a histogram with.
Using plotly express, create and print a histogram.
# creating a histogram using the total activity per minute per week
pa_plot = px.histogram(
pa,
x = "total_pa_m_wk",
labels= {"total_pa_m_wk": "Total Physical Activity per hr per week"},
title = "Distibution of the total physical activity done by each respondent per hr per week")
pa_plot- Moderate physical activity (400-700 hours per week) is the most common.
- Very high physical activity levels are rare but present.
- The distribution suggests that extreme physical activity levels are uncommon.
Plot 2: Boxplot
Boxplots are best used to visualize the distribution of a numeric variable, split by the values of a discrete/categorical variable.
Use plotly express to create and print a boxplot.
Use the same variable used for your histogram on your y-axis, and plot it against a categorical variable on your x-axis.
Set fill color to match the values of your categorical variable.
# creating a boxplot disaggregated by respondent's status:
status_plot = px.box(pa,
y = "total_pa_m_wk",
x = "status",
color = "status",
title = "Distribution of the total weekly physical activity by respondent status")
status_plotWrite a short explanation of what you see in the plot. - Adults generally have a higher median and more variable weekly physical activity compared to children. - Children’s activity levels are more consistent, but some outliers indicate highly active or inactive individuals. - Both groups have extreme cases of individuals engaging in very high weekly physical activity.the adults.
Optional Challenge Question: Health Risk Classification
Write a function
assess_risk()that:- Returns
"High Risk"if BMI ≥ 30 and total physical activity < 500 min/week
- Otherwise, returns
"Not High Risk"
- Returns
Create a new column
risk_categoryin yourpaDataFrame by applying this function.Make a percentage stacked bar chart showing the count of
"High Risk"and"Not High Risk"individuals, grouped bysex.
Comment on the findings. Which gender has a higher proportion of high risk individuals?
# Creating a ftn that assess rick based on BMI and total physical activity:
def assess_risk(bmi_score, total_pa):
if bmi_score >= 30 and total_pa < 500:
return "High Risk"
else:
return "Not High Risk"
assess_risk_vec = np.vectorize(assess_risk)pa["risk_category"] = assess_risk_vec(pa["bmi"], pa["total_pa_m_wk"])
pa| sex | status | bmi | who_recommendation | sedentary_ap_m_wk | light_ap_m_wk | mvpa_m_wk | total_pa_m_wk | total_pa_cat | risk_category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M | Adult | 32.2 | No | 4839.683333 | 674.633333 | 95.266667 | 769.900000 | medium | Not High Risk |
| 1 | F | Adult | 26.1 | No | 5027.283333 | 478.633333 | 56.783333 | 535.416667 | medium | Not High Risk |
| 2 | F | Child | 16.7 | No | 4506.916667 | 218.700000 | 327.366667 | 546.066667 | medium | Not High Risk |
| 3 | M | Child | 17.1 | Yes | 4700.533333 | 372.866667 | 549.733333 | 922.600000 | medium | Not High Risk |
| 4 | M | Adult | NaN | No | 8530.850000 | 100.716667 | 31.216667 | 131.933333 | low | Not High Risk |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 137 | F | Child | 20.1 | No | 4854.500000 | 399.733333 | 160.816667 | 560.550000 | medium | Not High Risk |
| 138 | F | Child | 19.8 | No | 4143.433334 | 499.483333 | 278.350000 | 777.833333 | medium | Not High Risk |
| 139 | M | Adult | 37.1 | Yes | 4463.583333 | 572.283333 | 134.616667 | 706.900000 | medium | Not High Risk |
| 140 | F | Adult | 37.6 | No | 5168.900000 | 587.983333 | 77.500000 | 665.483333 | medium | Not High Risk |
| 141 | F | Child | 35.6 | No | 5304.533333 | 354.650000 | 132.700000 | 487.350000 | low | High Risk |
142 rows × 10 columns
# stacked bar chart of the risk category by sex:
risk_cat_plot = px.histogram(
pa,
x = "risk_category",
color= "sex",
barnorm = "percent",
text_auto = ".1f"
)
risk_cat_plot- In the Not High Risk group, males and females are nearly equally represented.
- In the High Risk group, females significantly outnumber males, making up nearly 73% of this category.
- This may indicate that females are more likely to be classified as high risk compared to males in this dataset.